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Abstract 

Several efforts have been done to bring ROC analysis beyond (binary) classification, especially in regression. 
However, the mapping and possibilities of these proposals do not correspond to what we expect from the analysis of 
operating conditions, dominance, hybrid methods, etc. In this paper we present a new representation of regression 
models in the so-called regression ROC (RROC) space. The basic idea is to represent over-estimation on the x- 
axis and under-estimation on the y-axis. The curves are just drawn by adjusting a shift, a constant that is added 
(or subtracted) to the predictions, and plays a similar role as a threshold in classification. From here, we develop 
the notions of optimal operating condition, convexity, dominance, and explore several evaluation metrics that can be 
shown graphically, such as the area over the RROC curve (AOC). In particular, we show a novel and significant result, 
the AOC is equal to the error variance (multiplied by a factor which does not depend on the model). The derivation of 
RROC curves with non-constant shifts and soft regression models, and the relation with cost plots is also discussed. 

Keywords: ROC Curves, Asymmetric loss, Regression, Error variance, MSE decomposition 
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1 Motivation 



In classification, the traditional notion of operating condition is common and well understood. Classifiers may be 
trained for one cost proportion and class distribution (both making the operating condition) and then deployed on a 
different operating condition. Some of the techniques and notions for addressing these cases are cost matrices, cost- 
sensitive classification [10] and very especially ROC analysis [39, 48, 5, 49, 21, 38, 11, 37]. ROC space decomposes 
the performance of a classifier in a dual way. On the x-axis we show the false positive rate (FPR) and on the the y-axis 
we show the true positive rate (TPR). ROC curves neatly visualise how the TPR and the FPR change for different 
(crisp) classifiers or evolve for the same (soft) classifier (or ranker) for a range of thresholds. The notion of threshold 
is the fundamental idea to adapt a soft classifier to an operating condition. ROC analysis is the tool that illustrates 
(among other things) how classifiers and threshold choices perform. The number and variety of applications and 
areas (radiology, medicine, statistics, bioinformatics, machine learning, pattern recognition, to name a few) have been 
increasing over the years [23, 40, 34, 35]. Also, some metrics derived from the ROC curve, such as the Area Under 
the ROC Curve (AUC) are now key for the evaluation and construction of classifiers [12, 41, 51, 27, 44, 36] 

The adaptation of ROC analysis for regression has been attempted on many occasions. However, there is no such 
a thing as the 'canonical' adaptation of ROC analysis in regression, since regression and classification are different 
tasks, and the notion of operating condition may be completely different. In fact, the mere extension of ROC analysis 
to more than two classes has always been difficult because the degrees of freedom grow quadratically with the number 
of classes (see, e.g., [47, 18, 46]). The inclusion of probabilities (and other magnitudes) in ROC curves or the use for 
abstaining classifiers [16, 42, 13] has not paved the way on how to do similar things for regression. Consequently it 
is even questionable whether a similar graphical representation of ROC curves in regression (or other tasks [29]) can 
even be figured out. Notable efforts towards ROC curves (or graphical tools) for regression are the Regression Error 
Curves (REC) [4], the Regression Error Characteristic Surfaces (RECS) [52], the notion of utility-based regression 
[53] and the definition of ranking measures [45]. These approaches are based on gauging the tolerance, rejection 
rules or confidence levels. Some of these approaches actually convert a regression problem into a classification prob- 
lem (tolerable estimation vs. intolerable estimation). Another recent approach has been based on the calculation of 
Kendall's rank T correlation coefficient between the predicted and actual values [14], so disregarding the magnitudes. 
However, none of these previous approaches started from a notion of 'operating condition', related to an asymmetric 
loss function. Also, the notion of threshold was not replaced by a similar concept playing its role for adjusting to the 
operating condition, and the dual positive-negative character in ROC analysis was blurred. 

In this paper we present a graphical representation of regression performance based on a very usual view of 
operating condition, in regression. Many regression applications have deployment contexts where over-estimations 
are not equally costly as under-estimations (or vice versa). This is called the loss asymmetry. Loss asymmetry is just 
a kind of operating condition (or one of its constituents), but a very important one in many applications. 

The ROC space for regression is then defined by placing the total over-estimation on the jc-axis and the total under- 
estimation on the y-axis. This duality leads to regions and isometrics in the ROC space where over-estimations have 
less cost than under-estimations and vice versa, and we can plot different regression models to see the notions of 
dominance. We also consider the construction of hybrid regressors. The plot leads to curves when we use the notion 
of shift, which is just a constant that we can add (or subtract) to example predictions in order to adjust the model to 
an operating condition. This notion is parallel to the notion of threshold in classification. Interestingly, while we can 
derive the best shift for a dataset given an existing model (which boils down to shift it to make its average error equal 
to zero), there are some effective methods to determine this shift for the deployment data given an operating condition, 
as has been recently explored by [ 1 ] [56] . Also, there are some other ways to make this shift dependent to each example 
[28]. All this leads to a more meaningful interpretation of what the ROC curves in regression mean, and what their 
areas represent. This will also be explored in this paper. 

The paper is organised as follows. Section 2 introduces some notation, the problem of context-sensitive evaluation 
and the use of asymmetric costs in regression. The RROC space is introduced in section 3, where we represent several 
regression models as points, derive the isometrics of the space and develop the notions of hybrid models, dominance 
and convex hull. Section 4 introduces RROC curves, which are drawn by ranging a constant shift over the predictions. 
We introduce an algorithm for plotting them and determine some of its properties in terms of segment slopes and 
convexity. The area over the RROC curve (AOC) is also introduced and analysed. Section 5 discusses RROC curves 
with non-constant shifts and soft regression models, and the relation with cost plots. Finally, section 6 closes the paper 
with an enumeration of issues for future investigation. 
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2 Context-sensitive problems 



In this section we introduce some notation and the basic concepts about context-sensitive regression and the need of 
asymmetric loss functions. 

2.1 Notation 

Let us consider a multivariate input domain Xcl'' and a univariate output domain Ycl. The domain space D is 
then X x Y. The length of the dataset will usually be denoted by n. Examples or instances are just pairs (x,y) G D, and 
datasets are subsets of D. A crisp regression model m is a function m : X — > Y. A soft regression model accompanies 
each prediction with a reliability, confidence or, more generally, a conditional probability density function f{y\x) with 
y G Y and x G X. When the regression model is crisp, we just represent the true value by y and the estimated value by 
y. Subindices will be used when referring to more than one example in a dataset. 

Vectors (unidimensional arrays) are denoted in boldface and its elements with subindices, e.g., v = (vi, Vz, ■ ■ ■ , v„). 
Operations mixing arrays and scalar values will be allowed, specially in algorithms, as usual in the matrix arithmetic 
of many statistical computing languages. For instance, v + c means that the constant c is added to all the elements in 
the vector v. The mean of a vector is denoted by /j(v) and its standard deviation as a(v) — over the population, i.e., 
divided by n. Given a dataset with n instances i = 1 . . .n, the error vector e is defined as e, = — y, ■. The value ji (e 2 ) is 
known as the mean squared error (MSE), ju(e) is known as the mean error (or error bias), jU(|e|) is known as the mean 
absolute error (MAE) and jx(e) 2 as the error variance. 

2.2 Context-sensitive problems and loss functions 

In context-sensitive learning [10], there are several features which describe a context, such as the data distribution, the 
costs of using some input variables and the loss of the errors over the output variables [54]. In this paper, we focus on 
loss functions over the output, which is the kind of costs which ROC analysis deals with (typically integrated, along 
with the class distribution, within the notion of skew). A loss function is defined as follows: 

Definition 1. A loss function is any function £ : Y x Y — > K which compares elements in the output domain. For 
convenience, the first argument will be the estimated value, and the second argument the actual value, so its application 
is usually denoted by £(y,y). 

Typical examples of loss functions are the absolute error (£ A ) and the squared error (£ s ), with i iy,y) = \y — y\ and 
£ S (y>y) = (y~y) 2 - These two loss functions are symmetric, i.e. for every y and r we have that £(y + r 7 y)—£(y — r,y). 
Two of the most common metrics for evaluating regression, the mean absolute error (MAE) and the mean squared error 
(MSE) are derived from these losses. 

2.3 Asymmetric costs 

Actually, although symmetric loss functions (and derived metrics) are common for the evaluation of regression models, 
it is rarely the case that a real problem has a symmetric cost. For instance, the prediction of sales, consumptions, calls, 
prices, demands, etc., has almost never a symmetric loss. For instance, a retailing company may need to predict how 
many items will be sold next week for stock (inventory) management purposes, e.g., in order to calculate how many 
items must be ordered to refill the stock. Depending on the kind of product, it is usually not the same to over-estimate 
(increasing stocking costs) than under-estimate (an item is exhausted and it cannot be sold or sold with delays). In 
fact, it is also rare to find applications where even an asymmetric cost is invariable. For instance, depending on the 
warehouse saturation, the cost (and the asymmetry) may change in a weekly or daily fashion. We wish to remark here 
that a specialised model for a fixed given asymmetry is not the solution in many occasions, either. This motivates the 
adaptation (or refraining) of models, rather than their re-training for each new asymmetric loss. This is at the core of 
ROC analysis. 

There has been an extensive amount of work on regression using asymmetric loss functions. In some cases, the loss 
function is embedded in the learning algorithm (see, e.g., [8, 33]), which is useful if we know the operating condition 
during training. However, the adaptation (or refraining) of an existing model to a different operating condition has 
also been investigated for regression (e.g., Granger [24, 25]. Many different kinds of asymmetric functions have been 
explored: Lin-Lin (asymmetric linear), Quad-Quad (asymmetric quadratic), Lin-Exp (approximately linear on one side 
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and exponential on the other side) and Quad-Exp (approximately quadratic on one side and exponential on the other 
side) [55, 6, 7, 2, 50]. Some of these approaches try to adapt to the operating condition using complex (generally non- 
parametric) density functions, which is problematic in general. There are many other approaches. We just mention 
some of these approaches as an illustration of how important it is in practice to adjust regression models to work with 
a specific loss function. 

As mentioned above, there are many possible asymmetric loss functions. The simplest (and perhaps most common) 
one is the asymmetric absolute error 

Definition 2. The asymmetric absolute error £^ is a loss function defined as follows: 

£ A a (y,y) = 2a(y-y) ify<y 
= 2(1 — cc)(y — y) otherwise 

with a being the cost proportion (or asymmetry) between and 1, with increasing values meaning higher cost for 
low predictions (underestimation). In other words, when a = we mean that predictions below the actual value have 
no cost. When a — 1 we mean that predictions above the actual value have no cost. When a = 0.5 we mean that costs 
above and below are symmetric. 



3 The RROC space 

For every regression model deployed to a new dataset we can determine the error for each example and whether it is 
an over-estimation or under-estimation. More formally: 

Definition 3. The total over-estimation is given by OVER = J2,-{e,- I e \ > 0} and the total under-estimation is given by 
UNDER = Zi{ei \ e t < 0}. 

The following example illustrate this: 

Example 1. Consider a regression model m\ which is applied to a dataset with n = 10 examples e\ . . .e\Q, issuing the 
predicted values y and actual values y: 



123456 789 10 

-0.082 3.323 2.320 1.080 7.893 4.983 5.121 3.442 2.083 1.112 

0.211 2.725 1.933 3.242 7.858 6.061 7.173 3.082 0.894 1.203 

-0.293 0.598 0.387 -2.162 0.035 -1.078 -2.052 0.360 1.189 -0.091 



The error row (e) shows the difference, which is positive for over-estimations and negative for under-estimations. 
The sum of over-estimations ( OVER) is 2.569 while the sum of under-estimations (UNDER) is —5.676. This regression 
model clearly under- estimates (it has a negative error bias, since /x(e) < 0). The MAE (0.825) and the MSE (1.219) 
do not show the asymmetry of predictions. 



3.1 Showing models in RROC space 

Certainly, different regression models would show different error asymmetries (or error bias). The basic idea of the 
ROC space for regression is to show this asymmetry: 

Definition 4. The Regression Receiver Operating Characteristic (RROC) space is defined as a plot where we depict 
total over-estimation (OVER) on the x-axis and total under-estimation (UNDER) on the y-axis. Since OVER is always 
positive (but unbounded) and UNDER is always negative (but unbounded), we typically will place the point (0,0) on 
the upper left corner (the RROC heaven), and will clip both the x-axis and y-axis as necessary to show the region of 
interest. 

Figure 1 shows the RROC space and the regression model m\ in example 1. We will occasionally draw a diagonal 
line OVER — UNDER = to show the points where the under-estimation equals the over-estimation. 

One might argue why we use absolute values for the x-axis and y-axis instead of relative values. In fact, ROC 
analysis uses relative values. There are two reasons for this. First, using relative values would not make the RROC 
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Figure 1: RROC space and the representation for regression models mi (in red), m.2 (in blue) and (in green) in 
examples 1, 2 and 3. The diagonal (dashed) shows where UNDER and OVER are equal. Model m% has zero error bias 
(M(e) = 0). 



space finite. Second, and more importantly, using relative values we could have cases where changing a single in- 
finitesimal change on one example could end up at very different locations. For instance, consider the error vectors 
e A = {-10,-0.1,5} and e B = {-10,0.1,5}. While UNDER and OVER are almost the same, the relative UNDER and 
OVER would be {—5.05,5} and { — 10,2.55} for two almost equal error vectors. This justifies that the RROC space 
shows absolute values. In this sense, and strictly speaking, the parallel with ROC analysis for classification can be 
done with the 'coverage curves' [20], which are the absolute variant of ROC curves. 
Let us now consider a second model: 

Example 2. Consider a regression model mi which is applied to the same dataset as example 1: 



123456789 10 

0.786 2.078 0.587 1.676 9.052 5.875 6.885 3.038 4.097 0.308 

0.211 2.725 1.933 3.242 7.858 6.061 7.173 3.082 0.894 1.203 

0.575 -0.647 -1.346 -1.566 1.194 -0.186 -0.288 -0.044 3.203 -0.895 



The sum of over-estimations (OVER) is 4.972 while the sum of under-estimations (UNDER) is —4.972. This 
regression finds an equilibrium between over and under-estimations ( it is unbiased, since fl (e) =0). The MAE (0.9944) 
and the MSE (1.7619) are worse than m\ in example 1. 

This model (mi) with OVER — UNDER = is also shown in Figure 1 . Clearly it is on the diagonal. 
Finally let us consider a third model: 

Example 3. Consider a regression model m^ as follows: 



123456789 10 

1.253 4.232 1.734 5.325 6.842 9.325 8.232 3.525 1.352 1.778 

0.211 2.725 1.933 3.242 7.858 6.061 7.173 3.082 0.894 1.203 

1.042 1.507 -0.199 2.083 -1.016 3.264 1.059 0.443 0.458 0.575 



In this case, the sum of over-estimations (OVER) is 10.431 while the sum of under-estimations (UNDER) is —1.215. 
This regression model clearly over-estimates (it has a positive error bias, since /x(e) > 0). The MAE (1.165) and the 
MSE (2.12) show that this model is, in terms of overall error, worse than models m\ andm^. 
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From each point in RROC space, we can derive its MAE very easily. For model OT3, for example, we have that 
MAE = 1.165 = (OVER — UNDER) / n, so it is just half the perimeter of the rectangle that each point creates with 
the RROC heaven (0,0). In other words, the MAE (more precisely the absolute error) is just the Manhattan distance 
to RROC heaven. It is important to note that the diagonal (the Euclidean distance) is just given by ^J\OVER 2 + 
UNDER 2 ), which we call MMSE (as a macro-averaged version of MSE). This MMSE measure is interesting in itself, 
because highly penalises models for which there is a high imbalance in over and under-estimations, and can be seen, 
in some way, as a measure of 'symmetric calibration' [3]. 

In RROC space we denote the regression model always outputting °° and the model always outputting — °° as the 
(trivial) extreme regression models, which fall at (°°,0) and (0, — 00) respectively in RROC space. 

3.2 RROC space isometrics 

We have mentioned above that (1/2 of) the perimeter of the rectangle from RROC heaven to the regression model cor- 
responds to MAE. Can we extend this observation to the asymmetric loss? The following straightforward lemma shows 
that total asymmetric absolute loss can be calculated graphically as the sum of the distance to the y-axis (OVER = 0) 
and to the x-axis (UNDER = 0), using the appropriate asymmetry factor a. 

Lemma 1. The total asymmetric absolute loss is given by: 

L = J^Cfc.yi) = - 2a ■ UNDER + 2(l - a) ■ OVER. 

i 

Proof. 

L = ]^(#0'i) = L{ 2a ^'~^') if fi<yi> 2 ( l - a )(yi-yi) otherwise} 

i i 

= £{2o(-« / )|«i<0}+£{2(l-o)(« / )|« i >0} 

i i 

= -2a ■ UNDER + 2(1 - a) ■ OVER 

□ 

Clearly, for a — 0.5, we have that this is the absolute error. All this also shows that the closer we are to RROC 
heaven (0,0) (in terms of a Manhattan distance) the better. Finally, this leads to loss isometrics: 

Definition 5. RROC isometrics are defined by varying t over: 

-2a ■ UNDER + 2{\ - a) ■ OVER = t 

We can get any of the infinite (and parallel) isometrics. The following proposition just gets the slope of each 
isometric: 

Proposition 2. Given an isometric —2a ■ UNDER + 2(1 — a) ■ OVER = t, the slope only depends on a and is given 
by: 

1 1 - (X 
slope = 

a 

Proof. By isolating the variable UNDER we have: 

t -2(1- a)- OVER -2t I -a 

UNDER = ^ '- = 1 OVER 

-2a a a 

The slope is then given by the second term i~ □ 

Clearly, for a — (under-estimations have no cost) and we have infinite slope. For a = 1 (over-estimations have 
no cost), we would have a slope 0. 

This notion of isometric is very similar to the notion already present in ROC analysis for classification [19]. In 
fact, this means that we can slide isometrics to find optimal points in RROC space, in the very same way as we do in 
ROC space. 
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Figure 2: The three models as in Figure 1. We show the first isometric line (light grey) corresponding to a =0.8 
(slope — 0.25) touching any of the three models. 

Let us illustrate this. Figure 2 shows the RROC space and the regression models mi, 1112 and mi in examples 1, 
2 and 3 respectively. We also consider the operating condition a = 0.8, meaning that under-estimations are 4 times 
more expensive than over-estimations. This a leads to a slope of 0.25. By sliding through all the parallel isometric 
lines from the one crossing the RROC heaven (0,0) to the first isometric touching a point corresponding to any model, 
we touch at (10.431,-1.215) first. In fact, the intercept is given by isolating it from the line equation under = 
slope ■ OVER + intercept, i.e., intercept = UNDER — slope ■ OVER, which, in this case, leads to —3.82275. The line 
UNDER — 0.25 ■ OVER — 3.82275 is then shown on Figure 2, touching regression model m^. Even though model /M3 
has a worse mean (symmetric) absolute error than m\, for this operating condition a, it leads to lower total asymmetric 
absolute error. While m\ has a loss of -2a ■ UNDER + 2(1 - a) ■ OVER = -1.6 • (-5.676) +0.4 ■ 2.569 = 10.1092, 
we have that w 3 has a loss of -2a • UNDER + 2(1 - a) ■ OVER = -1.6 • (-1.215) +0.4 ■ 10.431 = 6.1164. 

3.3 Hybrid models, dominance and convex hull 

Another construction that is also originally present in ROC analysis for regression is the notion of hybrid models. 
Given any two models, we can construct a hybrid model by randomly choosing each prediction from any of both 
models using a (biased) coin. Note that this is very different to averaging both models. 

Figure 3 shows the isometric (in light grey) passing through models m\ and 7713. The solid black segment connecting 
both models shows that any model along the segment can be constructed. More precisely, each point in that segment 
would represent the expected value of a model constructed in this way. Consequently, we can just connect both points 
since any point in between is technically achievable (at least in expectation). 

In this particular case, we just draw a line between the point representing m\\ (2.569,-5.676) and the point 
representing my. (10.431,-1.215), leading to UNDER = 0.567 • OVER - 7.134. From this slope of 0.567, we just 
calculate a = l+ ^ ope = 0.638. Obviously, for this a both models have the same loss. L{m\) = 0.638 -5.676 + (1 — 
0.638) -2.569 = 4.551 and L(m 3 ) = 0.638 • 1.215 + (1 -0.638) • 10.431 = 4.551. 

Given these two models, we say that, for slopes lower than 0.567 and asymmetries a greater than 0.638, model 
»J3 dominates, while we have that model m\ dominates for the rest of operating conditions. 

This leads to the notion of dominance and convex hull. In fact, when connecting all the points by the segments 
representing the hybrid models (and also including the extreme classifiers at (0,— °°) and (°°,0), we can calculate the 
convex hull, since any model under the convex hull can be discarded, in the same way as traditional ROC analysis 
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Figure 3: The three models as in Figure 1. By considering any model which can be constructed by just choosing 
predictions randomly (with any bias) between models m\ and m^, we can show a segment of models (in solid black). 



DOES. Figure 4 shows the convex hull of the three models and the extreme models. We see that model mi can be 
discarded. It cannot be optimal for any operating condition. 



4 RROC curves 

In ROC analysis for classification, we can tweak the predictions of a crisp classifier by changing the predicted class 
to a random percentage of examples. With this, we can move the classifier in the ROC space, but this just moves 
the classifier along the two straight lines that connect the original point with the points at (0,0) and (1,1) (the trivial, 
or extreme, classifiers). For this reason, occasionally a crisp classifier is represented in ROC space as trapezium 1 , 
connecting the point which corresponds to the classifier with the extreme classifiers. This two-segment 'curve' does 
not bring more information than the original point, but shows that other TPR and FPR can be achieved by this random 
swapping of examples. In the end, it just shows the hybrid classifier constructed with the extreme classifiers. 

In general, however, in ROC analysis, curves are constructed by the use of soft classifiers, i.e., classifiers which 
output a rank, score or probability estimation. By moving a threshold from the lowest possible valuable to the highest 
possible value (or vice versa) we get many possible crisp classifiers, each of them represented by a point in ROC space. 

Interestingly, in RROC space, we do not need soft regression models in order to create a curve. It is just sufficient 
to use a shift, which works as a parallel concept to the notion of threshold. For each example we can get a modified 
prediction as y «— y + s, where s is the shift. Although there are, as we will see, many ways of determining this shift, 
it seems natural to consider first that s is constant, i.e., that we apply the same value for all the examples. 

Definition 6. Given a regression model m, a (constant-) shifted regression model, denoted by m(s), is the result of 
adding the same shift s to all its predictions, i.e., y' <— y + sfor all predictions y. 

This shift can be moved from the lowest possible value (— °°) to the maximum possible value (°°). This leads to 
the notion of RROC curve. 

Definition 7. Given a regression model m, its RROC curve using a ( constant) shift is given by plotting all the models 
m(s) with s ranging in [— °°,°°]. 



1 A trapezoid in American English. 
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Figure 4: The three models as in Figure 1. By considering any model which can be constructed by just choosing pre- 
dictions randomly (biassedly) between any other two models — including the extreme models at (0, — °°) and (°°,0) — 
we can derive the convex hull (shown in solid black). 



We can instantly plot the curves pointwise, by just using a sufficient dense range of values for s. However, there is 
a more direct way of plotting and analysing the RROC curve if we investigate a little bit. This is what we do next. 

4.1 Algorithm for drawing RROC curves 

We can realise that if we move the shift from s\ to 52 and no example changes from OVER to UNDER or vice versa, 
then the increment/decrement in OVER and UNDER is linear, as the following proposition shows: 

Proposition 3. Given a model m, for any two shifts s\ and S2 such that the examples for which m(s\) and m(s2) 
over-estimate are the same (and hence the rest that under-estimate are also the same for both), then for any other shift 
S3 with s\ < S3 < S2 we have that the points (OVER, UNDER) for the three models m (sj), m (S2) and m (53) lie on the 
same straight line. 

Proof. We have that OVER for m (s\) is calculated as: OVER\ = £,{e; + S\ | e; + Si > 0} while OVER for m (S2) is 
calculated as: OVER2 = + s 2 | <?,• +S2 > 0}. Since, by assumption, the examples which over-estimate are the 
same for m (si) and m (52), let us call this number n . The previous two expressions can then be rewritten as: 

OVER l = n si + £{e; | e i + s l > 0} 

i 

OVER2 = n 01 S2 + £{e< I et + si > 0} 

i 

Note that the second term is also rewritten with s\, since the elements are the same. In this way, we express that the 
second term is equal. Also, since the examples which over-estimate are the same for s\ and S2 they have to be the same 
necessarily for every 53 with s\ < si < S2 as well. So, we also have: 

OVER 3 ^ n s 3 + | e t + si > 0} 
i 

We can see that these three co-ordinates only differ on the first term, which is linearly related to s (si, S2 or S3). We 
can obtain similar expressions for UNDER\, UNDER2 and UNDER3 and their n u examples. This means that the three 
points are related by a linear term on s, expressed as (n s, n u s) so they lie on the same line. □ 



10 



From proposition 3 we can introduce a very simple algorithm to draw RROC curves: 



Algorithm: PlotRROCCurve 

input : Two arrays y and y of size n with the predicted and the actual values respectively, 
output: The n + 2 vertex points of the RROC Curve in arrays RROCX and RROCY 

// Draws the curve from bottom-left corner to top-right corner 

e <— SortDecreasingly(e) 
RROCX! <- 

RROCYi i oo 

for i 1 -i— 1 to n do 

s <— <?,■ // The shift s as examples change from OVER to UNDER 

t <— e — s II Applies a constant shift s to the array e 

RROCXi+i <- Zj{tj I tj > 0} // OVER 

RROCY i+l <- Y,j{tj I tj < 0} // UNDER 

end 

RROCX„ +2 <- °c 
RROCY„ +2 <- 

Algorithm 1: Algorithm for drawing a RROC curve. We use brackets for array notation and array operations. The 
algorithm can be further simplified by updating the array e and calculating OVER and UNDER incrementally in 
each iteration in the loop. 



From the first line of the algorithm, we see that the RROC Curve can be drawn by just giving the error vector (e.g., 
the last row in examples 1, 2 and 3). 

Figure 5 shows a RROC curve using this algorithm for m\ in example 1. The points where the slope of the RROC 
Curve change are called vertex points, and the rest of points are said to fall onto the segments. Consequently a RROC 
Curve for a regression model applied to a dataset with n instances has n + 2 vertex points (typically, only n are visible 
on the plot, because two are the extreme points) and n + 1 segments, denoted by i, i + 1 with i = 1 . . . n + 1. We clearly 
see n = 10 points on Figure 5. 

In case there are some ties in the error vector, then some of these vertex points and segments collapse into a single 
point. Figure 6 shows 

Example 4. Consider a regression model m\ as follows: 



123456789 10 

0.123 1.221 1.845 4.573 8.558 7.392 5.669 1.578 0.806 1.245 

0.211 2.725 1.933 3.242 7.858 6.061 7.173 3.082 0.894 1.203 

-0.088 -1.504 -0.088 1.331 0.700 1.331 -1.504 -1.504 -0.088 0.042 



We see a triple tie between examples 1, 3 and 9, another triple tie between examples 2, 7 and 8, and a double tie 
between examples 4 and 6. With this, there are only 5 different error values. 



4.2 Properties: slope and convexity 

From the new RROC curve, we may want to determine the slopes of each segment, in order to exactly determine where 
each possible isometric (and asymmetry a) would lead to on the curve. This can be done very easily, as the following 
lemma shows: 

Lemma 4. The slope of each segment i,i+ 1 in the RROC curve is given by (n + 1 — /)/(/— 1), with i= 1 . . ,n + \. 

Proof. Let us assume no ties in the error vector. As shown in proposition 3, there is one example changing from 
UNDER to OVER (from bottom-left to top-right) at each vertex point. At the first vertex point i = 1, all the examples 
are under-estimated, and the shift change moves along an infinite slope. For the next vertex point i = 2, we have 
n — 1 under-estimated examples and 1 over-estimated example. This means that the shift change moves along one unit 



11 



o 
I 




Figure 5: Model m\ in example 1 drawn as a RROC curve by changing the shift. Vertex points (10 in this case, since 
the two extremes are not visible in the plot) are shown as small circles. The curve is then composed of 1 1 segments 
(there are 10 examples). The original shift (s = 0) is still represented with a small square and lies on a segment between 
two vertex points. 
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Figure 6: Model in example 4 drawn as a RROC curve by changing the shift. The model has several errors with 
the same value (two triple ties and a double tie), so the number of distinct visible points is reduced from n = 10 to 
n-2-2-l=5. 
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Figure 7: The three models m\ (red), ni2 (blue) and mj, (green) in examples 1, 2 and 3 drawn as RROC curves by 
changing the shift. Note that in this case model ;«3 cannot be rejected, because there are regions where it is optimal. 
If we select the best portions from the three models we see concavities, which can be resolved by the use of a convex 
hull. 



right and n — 1 units up, with a slope of n — 1 . By induction, this leads to (n + 1 — /)/(/ — 1 ), with the last segment 
having slope. If there are ties, the result is similar with more than one example changing from under-estimation to 
over-estimation at a time. □ 

Thus, and somewhat surprisingly, given a fixed number of examples, several regression models will have exactly 
the same slopes. The difference between the curves will be given by the length of the segments, not their slopes. 
From the equation = slope in proposition 2 relating asymmetries and slopes, we have that each segment 1 
corresponds to an a = slo p e+l , leading to a = '-j^- with i = 1 . . . n + 1 . 

Finally, from the previous Figure 5, we see that the curve is convex. Is this true in general? The following 
proposition shows it is. 

Proposition 5. For every regression model, the RROC Curve is convex 2 . 

Proof. It is direct from lemma 4 since the sequence of the segment slopes of the curve (n + 1 — i) /(/ — 1) is non- 
increasing. □ 

The convexity of a single RROC curve does not mean that the notion of convex hull seen in the previous section is 
useless for curves. More on the contrary. Whenever we have more than one model, we can see concavities. Figure 7 
precisely shows this. 

From these three curves, we can calculate their convex hull, as shown in Figure 8. 



4.3 Areas and metrics 

RROC analysis, as ROC analysis, can be especially useful for analysing models under different operating conditions 
and select the best one for a single operating condition or a region, or even better, to create hybrids through the notion 
of convex hull. Nonetheless, in ROC analysis we are also interested in evaluating models that can work well for a wide 

2 Note that in ROC analysis we typically say 'convex' when the region below is a convex set, while, generally, in mathematics, this refers to the 
region above being a convex set. 
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Figure 8: Convex hull of Figure 7, shown in black. There are 12 visible points (represented as black crosses) on the 
convex hull: 6 from m\ (red), 3 from TO3 (green), and 3 from m-i (blue). 

range of operating conditions. One measure that gives us a good indication of a classifier performing well in a wide 
range of operating conditions is the Area Under the ROC Curve (AUC). Can we develop a similar measure for RROC 
curves? 

The good mapping so far between ROC curves and RROC curves in terms of what they represent suggests that this 
is possible. The following definition introduces such a measure: 

Definition 8. The Area Over the RROC Curve (AOC) is defined as follows: 

AOC = - J UNDER dOVER = J OVER dUNDER 
Lower values for AOC are better. 

The previous area can be calculated very easily using the sum of the n + 1 upward trapeziums given between the 
elements 1 and n+2 from RROCX and RROCX in algorithm 1 . Actually, for models always outputting finite values, 
this can be calculated from 2 to n, since the extreme trapezium 1 to 2 has area and the trapezium n + 1 to n + 2 as 
well, so this only need to sum n— 1 trapeziums. Consequently: 

" RROCYi+i+RROCYi , 
AOC = £ (RROCX i+l - RROCXt) 

1=2 2 

The first question about this area is why we have defined the area over the curve and not under the curve. This has 
an easy answer: since the RROC space is unbounded, the area under the curve is always infinite. But what about the 
AOC1 The following proposition gives an answer: 

Proposition 6. For any regression model m which always outputs finite values, the AOC is finite. 

Proof. Since the model m always outputs finite values, there is a shift s , such that for any shift s < s l} we have that 
OVER = and there is also a shift s u , such that for any shift s > s u we have that UNDER = 0. This means that the 
curve touches (and stays at) both the x-axis and the y-axis. Then the area is finite. □ 

For the three models in Figure 7, the AOC is 56. 1387, 88.0933 and 63.9295 for models mi, ni2 and mj respectively. 
Although a single number loses most of the information we can see on the curve, these numbers summarise their overall 
performance. 
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Figure 9: We use a dataset with 1,000 examples using a normal distribution with mean at and standard deviation 
0.01. The first model (in violet) is just generated as a random model using the same distribution (MAE = 0.011, 
MSE = 0.00019, AOC = 93.28). As we see, its performance is low. The second model (in orange) is just built by 
using the actual values plus a random normal noise with the same distribution (MAE = 0.0079, MSE — 0.00010, 
AOC = 49.83). Finally, the third model (in brown) is just a model always outputting the same value, the mean of the 
actual data (MAE = 0.0080, MSE = 0.00010, AOC = 50.31). These two last models seem to have very similar metrics 
and curves (they almost overlap completely). 

From the notion of AOC, we can investigate what exactly means to have low AOC and high AOC. The 'best' 
model in terms of AOC (a perfect square with top-left corner at the RROC heaven (0,0)) means that there is a shift 
that achieves error. This is rarely the case, except for datasets for one single example (where there is always a shift 
getting loss). It is also very rare to have a dataset for which the error is always the same, another possible situation 
where we would have AOC. Note that a model with very high MSE or MAE could, in principle, have AOC = 0. This 
would suggest that the shift was very badly chosen. The parallel with classical ROC analysis here is clear, where we 
can have bad accuracy for a model with optimal AUC by choosing a bad threshold. 

What about the 'worst' model in terms of AOC1 Of course we can have a value of AOC as high as we want. We 
can even get an infinite AOC, if the model outputs °° or — °° for one single example. So, the question must be stated 
more precisely: given a model with a certain MAE, what is the worst value for AOC1 This is difficult to answer. At 
first sight, it seems that the degree of dispersion of the error may affect, since it may make the shift more effective. 
Also, the degree of correlation between the actual and predicted values could be important. Figure 9 shows how a 
random model looks (in violet), which typically shows low performance. Also, it compares two models with similar 
performance, but one which is just generated adding random noise to the true values (in orange) and the other by 
calculating the mean of the true values (in brown). While these two last models have very different dispersion (the 
last model has null dispersion) and very different correlation (the last model has null correlation), their metrics and 
RROC curves are very similar. This is explained because their error distributions are similar. Hence, one possible way 
of looking at RROC curves is precisely this. They represent the distribution of errors. 

A different question is to give a numerical interpretation of the AOC. While its definition suggests that it may be the 
expected value of the total under-estimation given a uniform value for the total over-estimation, this is not well-defined 
because both UNDER and OVER are not bounded. A possible interpretation is that it aggregates the macro-average 
squared error (n -MMSE) with a distribution which depends on the model 1 , which is similar to one recent interpretation 

3 Note that the length of each segment may represent the frequency of each possible value of the asymmetry parameter a. 
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given to AUC [26]. Other interpretations as an aggregation of expected loss may be possible 4 , as it has happened to 
AUC recently, where new interpretations have been introduced [22, 32]. 

Having said all this, our previous idea of the AOC being related to the distribution of errors seems more appealing. 
If we have a compact error distribution, then AOC will be low. If we have a sparse error distribution, then AOC will 
be high. One classical measure of dispersion is precisely the variance, defined and decomposed as follows: 

Definition 9. The error variance ff(e) 2 is defined as: 

ff(e)2 a i:' =1 (e,-M(e)) 2 = ruf[ _ 2 = 2 _ 2 = MSE{t) _ 2 

n n 
where jtx(e) represents the mean of the vector e. 

Note that we define the population variance, by dividing by n (instead of n — 1). The reason is just to keep 
the expressions that will follow next as simple as possible. We will use just a (instead of cr(e)) and ji (instead of 
ju(e)) when clear from the context. The last term in definition 9 is just a different way of showing the classical MSE 
decomposition as the sum of the squared error bias (jj. 2 ) and the error variance (C7 2 ). 

Quite surprisingly, the observation that the AOC and the error variance are related can be made extremely precise, 
as the following theorem shows: 

Theorem 7. The area over the RROC curve equals the population variance G 2 of the errors multiplied by a factor 
n ji which is independent of the model. Namely: 

_ oV 
AOC= — — 
2 

Proof. We start with an error vector e of length n, which we assume is sorted in decreasing order, as in algorithm 1 . 
We use a different notation for the points in the RROC curve. Instead of using n + 2 points, we will just ignore the 
two extremes (which do not contribute to the area for finite cases) and we will just work with n points, denoted by 
pi,...,p n . The components of each point are p, = (oi, ui). Note that o, = RROCXj + \ using the notation in algorithm 
1 and Uj = RROCYi + \. We will also introduce the error differences d\ = <?,■ — e,+i, which are defined from i = 1 to 
i = n — 1 . Note that di > since the error vector e is in decreasing order. It is easy to see that o, = j ' dj and 

itj = —YljZi ( n — j) ' dj. According to this notation: 

Ann V 1 M i + "i+l / \ 
AOC = - 2, r (o,+i - Oi) 

i=\ L 

In order to prove this theorem, we will proceed by induction. 
Base case 

The base case will consider any error vector of size n — 2. In this case, we only have two points p\ = (0, —d\ ) and 
P2 = (di,0). From here, 



Ann Xui + Ui+i -di+0. d 2 (e 2 -ei) 2 
A OC = = (°i+l-°i) = 2 ( rf i-°)= 2 = 2 

l'=l ~ 

= (e 2 -M + M-gi) 2 = (e2 - AO 2 + (M - e i ) 2 + 2 ( e 2 - aQ(M ~ ei) 

2 2 
_ 2(e 2 -Al) 2 + 2(Ai-ei) 2 _ 4a 2 _ C 2 n 2 
~ 2 _ "T~ _ 2 

Inductive step 

We assume that 

(1) 

4 We suggest some possible pathways for exploration. Since AOC is related to the magnitude of predictions (and errors), it cannot be directly 
related to rank-based correlation measures in regression. However, it could be related to this sum Y, yi , )2 ($\ — Si I yt > yz)> which would work as a 
counterpart of the Wilcoxon-Mann-Whitney interpretation of the AUC. 
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holds for any dataset of size n. 

Without loss of generality, we consider that the case for n + 1 is constructed by adding example e„+i, assumed to 
be lower than the other examples e\,e2, ■ ■ ■ ,e„ in the case for n. Consequently, the error vector for the case n + 1 is 
e i , <?2 , • ■ ■ , e„ , e n+ \ . The difference vector is also an extension for n + 1 , denoted by d\ , d%, . . . , d n , Note that since we 
assume that eq. 1 holds for any dataset of n examples, we can choose the order of examples that we prefer in order to 
build any case with n + 1 examples. 

The AOC for the n case is given by: 

W — 1 ,. I .. 

AOC = - 2, z (0/+1 - Oi) 

i=l L 

The AOC for the n + 1 case is given by 

AOC=-t^f^(o i+ i-o,) (2) 

/=! 

We will use a wide tilde to denote the AOC, G, fx, etc., for the n + 1 case. The first thing we can see is that 
u\ = u\ — d\ — di d„,U2 = ii2 — d2 d„, etc. We use these latter expressions on (2): 

~ A m + {-I "=^,l +um + {-I"=, + i „ 

AOC = - £ i i (o/+i - o,-) 

i=l z 

The second thing we realise is that o, and o ( - are equal for i =1 ...n. From here, we can calculate the delta between 
n + 1 and n as follows: 

A a /-\/-^ a /Tp^ Arin ^ {^T,'j=idj} + {-Y, n j = i + idj} 

AAOC = AOC — AOC = ^ (o,- + i — o,) 

i=i 2 
But we have that o;+i — o, = i • dj. So, we rewrite: 

AAOC = L 9 

i=l l 

h 2 

" i-df + ir^j+ji-didj 
2 

i=i 

Using the expression of the square of a sum: (X^a;) 2 = L; fl ? + 2Ei<;flifl/, and joining/distributing terms, we see 
that the above expression can be rewritten as: 



AAOC = 



i=l 

A (e„ + i -e;) 2 
i=l 2 

i=l 2 
1 
2 

1 



"• e «+l- 2e »+iE e ' + E^ 

i=i 1=1 



= -(n-el +l -2e n+l n-H+n(o 2 +ll 2 )) 
= \iel +l -2e n+l li + a 2 +ll 2 ) 
= \{{e„+x-pi) 2 + 2 ) 
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From here, we can now write: 



AOC = AOC + AAOC = AOC + ^ ((e n+1 -/i) 2 + g 2 ) 



From the induction step (equation 1), we have: 



~- a 2 n 2 n ,. 7 , 

AOC = — + ^((e„+i-ix) 2 + a 2 ) 

= ^(<J 2 n + (e n+l -ll) 2 + 2 ) 

= n -(o 2 {n + \) + {e n+l -ll) 2 ) 

= n -({^^-pL 2 ){n+\) + {e n+l -pLy 



(£ e 2 -nll 2 ){n+l)+n- {e n+l - y.? J 

' n 

(L^)( n + 1 )"("J u2 )( n + 1 )+ n ' e «+i -2n-«; 
v '=i 

' n+l 

( L (" + 1 ) - ( n ^ 2 ) (") - e «+ 1 - 2« • e„+iM 




=i 

' n+l 

(£^)(«+l)-("M + e„+i) 2 

v '=1 

' n+l N 
(£ e 2 )(n+1) _ ((n + 1)jC[) 2 



= 1 



2 



n+l r 



= 2(" + 1 ) 2 ( 52 ) 
g 2 (n+l) 2 
2 

This last expression completes the induction step and so does the proof. □ 

Corollary 8. If the model is unbiased (i.e. jtx(e) = 0) then: 

MSE-n 2 

AOC= 

2 

Proof. The proof is direct from theorem 7 and definition 9 (MSE decomposition). □ 

For the models nt\, and in examples 1, 2 and 3 we have a variance of 1.1228, 1.7619 and 1.2786 respectively. 
The AOC is 56.1387, 88.0933 and 63.9295 respectively. Since mi is unbiased, its MSE is precisely 1.7619, its error 
variance. The constant factor is n 2 /2 = 50 in the three cases. Similarly, for the third model (in brown) in Figure 9, as 
it always outputs the mean of a distribution with standard deviation 0.01 and 1,000 examples, we have that the AOC 
was 50.31. The expected result is 0.001 • 1000 2 /2 = 50. The difference is not given because theorem 7 is approximate, 
but just because the sample is generated with a distribution with a 2 = 0.0001, but the sample does not exactly have 
this variance (it is actually 0.00010062). 

Given the connection between the area over the RROC curve and the population variance, we can explore the 
connection between the RROC curve and an error density plot. As we can see in Figure 10, there is a high correspon- 
dence between the density plots and the RROC curve, but the cumulative character of the RROC curve make the latter 
smoother. 
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Figure 10: The error density plots of the same three models as in Figure 7. Note that we can see that similar error 
density functions will produce similar RROC curves. Here, more peaked density functions mean better RROC curves. 

Note that this connection between AOC and the error variance indicates that it is the dispersion that counts when 
trying to adapt our models to cost-sensitive situations with asymmetries, and not the position, which can be ignored 
by assuming that the optimal shift will be chosen for each particular operating condition. This again shows a parallel 
with ROC analysis. In ROC analysis, the absolute values of the scores do not affect the AUC. Only their order matters. 
Here, for RROC curves, the position of the mean error (the error bias) does not affect the AOC, only the dispersion of 
the error. 

This is a fundamental result as well because it is a graphical representation of the error variance, which can sum 
up to the applicability of RROC curves. The n 2 factor in theorem 7 also suggests that a scaled representation of RROC 
curves could be done by dividing both the .x-axis and y-axis by n, i.e., plotting OVER/n against UNDER jn. This 
would make the curves independent of the number of examples, but the meaning of each point would be somewhat 
blurred, as the 'average over-estimation or (under-estimation) per example'. Nonetheless, this could be the standard 
representation in many application, especially when the number of examples in the datasets may vary or we may even 
compare several models (or the same model) against different datasets (with different sizes). Figure 1 1 shows the same 
plot as Figure 5 but normalising by the number of examples (in this case n = 10). 

5 ROC curves for non-constant shifts and soft regressors 

In classification, there are many possibilities for choosing the threshold [32]. In regression, there are many possibilities 
as well for the shift. Until now, we have considered that the shift is chosen as a constant. Other possibilities rely on the 
use of any function of the prediction and the operating condition. Figure 12 shows the model m\ from example 1 using 
a constant shift, the same model using a third-degree polynomial, and the same model using a third-degree polynomial 
combining s and y. As we can see on the figure, there are places where the use of a different shift formula can reach 
places where the constant shift cannot. Actually, we can find functions such that the predictions are modified in such a 
way that they can attain any point on the RROC space. However, in order to get close to the RROC heaven, we would 
need very ad-hoc functions, basically embedding an error correction inside. 

In general, we are interested in shift functions and methods that are systematic (i.e., a procedure which is the 
same for all models). Clearly, a constant shift is a systematic method, provided we find a way to find the appropriate 
constant for each operating condition. Recently, a method to find the appropriate shift for each operating condition 
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Figure 11: Model mi as in example 1 drawn as a normalised RROC curve, using a normalised scale for the x-axis and 
y-axis (n=10 examples). 
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Figure 12: RROC Curve of model m\ from example 1 using a constant shift as usual: y + s (red), the same model using 
a third-degree polynomial, as 0.9 ■ y + 0.002 • y 3 + s (purple), and the same model using a third-degree polynomial 
combining s and y, as y — 0.004 ■ y 3 + 1 .5 ■ s ■ y 2 + s (orange). 
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(asymmetry) has been introduced [1]. Simply, given a value of a, the method calculates the best shift for the training 
set. Then, this shift can be applied to the test set. This method does not obtain the optimal shift for every a, but if the 
training set and test set are similar, the approximation can be good. By ranging over operating conditions a (instead 
of shifts), and using this method, we can construct a RROC curve which does not show the evolution of OVER and 
UNDER for an optimal (or ideal) shift choice method, but an actual, feasible one. This reinforces the view expressed 
by [32] for classification: we evaluate pairs of models and threshold choice methods. The translation to regression and 
RROC curves is that we plot models assuming a shift choice method (both threshold choice methods and shift choice 
methods are types of refraining methods). In the previous sections, we have assumed an optimal constant-shift choice 
method, but many other options exist and may lead to other curves for the same model (which are not necessarily 
convex). The good thing about the RROC space is that we can visualise several options in the same plot, as done with 
Figure 12, and evaluate both models and shift choice methods at the same time. 

Overall, there are many shift choice methods to be explored. For instance, [56] generalise the constant-shift choice 
method from [1] by using any polynomial function. A different, and more powerful, perspective is introduced by [28], 
where instead of using crisp models, the regression model accompanies a standard deviation to each prediction. This 
standard deviation is used to better adjust the shift according to each example, which is now a function of two variables 
instead of one. The adjustment is found by risk minimisation. 

This also suggests the exploration of the connection between RROC curves and its corresponding cost curves. 

Definition 10. The cost space for regression is defined as a plot where the expected loss ( e.g., the asymmetric absolute 
loss) is shown on the y-axisfor a range of operating conditions (e.g., the asymmetry a). 

Figure 13 shows this cost space, which is similar to the cost space of Drummond and Holte's cost curves for 
classification^]. The investigation of the mapping between the regression cost space and the RROC space can lead to 
new important findings as has been recently done for classification [31]. 

6 Concluding remarks 

We said in the introduction that there is no such a thing as the 'canonical' ROC space for regression, corresponding 
exactly to the ROC space for classification, since regression and classification are very different tasks. Having said this, 
we think that the RROC space, curves and analysis that we have introduced in this paper present so many parallelisms 
and share so many notions and procedures, that their curves could reasonably called the ROC curves for regression, 
with arguable more support than other previous attempts. We have seen that the notions of operating condition, cost 
asymmetry, RROC space, points, segments, RROC heaven, RROC isometrics, hybrid models, convexity, dominance, 
convex hull, curves, shift choice methods, etc., derive smoothly and work almost the same as in the classification case, 
so the practitioners which are used to ROC curves can directly apply their expertise on ROC analyse to regression 
quite easily. 

There are naturally several issues which could lead to more general (or slightly different) notions of RROC curve 
for regression, keeping the same basic structure. The first issue that could be explored and generalised is the very notion 
of operating condition. We have only considered the asymmetry while, in classification, the class distribution can also 
be integrated (along with the cost proportion) in what is usually referred to as skew. In regression, the distribution of 
the output value (and not only the loss asymmetry) may also be considered part of the operating condition as well. 
This integration does not seem to be direct, but it is worth being investigated. 

A second issue is the use of other loss functions. For instance, instead of an asymmetric absolute error, we could 
use an asymmetric squared error Quad-Quad. We guess that this would lead to non-straight isometrics and non-straight 
segments in the RROC curve, but the basic ideas would remain. Again, plotting different isometrics in RROC space 
for many different loss functions (Lin-Lin, Quad-Quad, Lin-Exp, Quad-Exp, etc.) would be a work on its own, very 
much resembling the celebrated paper [19] on isometrics for ROC curves in classification. 

A third important avenue of future work would be to further investigate the connection with the error variance we 
have unveiled here and to analyse the relation of AOC to other metrics, as well as the relation of RROC space with 
other plots to analyse the performance of regression. We think that RROC curves represent the expected loss for a 
range of operating conditions on one side, and the distribution of the error on the other side. There may be important 
connections to be unveiled between regression techniques trying to minimise the error variance (which we have shown 
here to be equal to the AOC) instead of squared error and those classification techniques trying to maximise the AUC 
(which has recently been shown to be equivalent to the refinement loss term of the MSE decomposition using the ROC 
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Figure 13: A cost curve ([28]) showing the absolute loss (definition 2) against the operating condition a (asymme- 
try) of a model using several shift choice methods (reframing methods). The 'None' method represents no adjustment 
and corresponds to a single point in RROC space. The 'CoSh' and 'PoSh' are the constant-shift choice method from 
[1] and the polynomial-shift choice method from [56]. Finally, the 'Own', 'uKNC and 'BIN' methods are probabilis- 
tic reframing methods based on soft regressors using a two-parameter output for each instance and a risk-minimisation 
solution. Note that the optimal constant-shift choice method is not shown here, but it should fall below the 'CoSh' and 
'PoSh' methods. 
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curve [32]) instead of accuracy [ 1 2] [ 1 5] . So we anticipate a plethora of connections between RROC curves and many 
other performance metrics in regression, as has been done for classification in the past years [17, 26, 22, 30, 32]. 

Overall, we think that RROC curves could become a fundamental tool in the assessment, improvement and de- 
ployment of regression models. In order to facilitate their use in real applications, we have developed a library for 
plotting RROC curves, calculating their areas and deriving their convex hulls. The software, in R [43], is available at 
http://users.dsic. upv . es / ~ j oral lo /RROC/. The availability of software, the ubiquitous appearance of 
asymmetric losses in regression applications, and the success of ROC analysis for classification in the past decades 
suggests that RROC curves may soon become mainstream in all the areas where ROC analysis has shown to be useful: 
medicine, bioinformatics, decision making, statistics, machine learning and pattern recognition. 
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